Introduction

Row

Overview

For this project, we will follow the DCOVAC process. The process is listed below:

DCOVAC – THE DATA MODELING FRAMEWORK

  • DEFINE the Problem
  • COLLECT the Data from Appropriate Sources
  • ORGANIZE the Data Collected
  • VISUALIZE the Data by Developing Charts
  • ANALYZE the data with Appropriate Statistical Methods
  • COMMUNICATE your Results

Row

The Problem & Data Collection

The Problem

  • Problem Description The Car Crash 2019 Data used in this dashboard shows all of the car crash data from 2019 in the United States. We will examine the variables in the dataset to determine what helps to predict the number of injuries in a car crash as well as if there was alcohol involved or not.

The Questions

  • Aanlysis Questions
  1. Can the day of the week predict weather or not there was alcohol present in the crash?
  2. Does the number of vehicles predict the number of injuries?
  3. What region has the most crashes?
  4. Which Variables best predict the number of injuries?

The Data

This dataset has 1345 rows and 16 variables. For this analysis, we will ignore the all of the categorical named variables that come with the dataset.

Description of the Variables in the Dataset

  • VARIABLES TO PREDICT WITH:

  • URBANICITY: where the accident took place(1=Rural Area, 2=Urban Area)

  • REGION: what region the accident took place(1=Northeast, 2=Midwest, 3=South, 4=West)

  • VE_TOTAL: total number of vehichles involved in the accident

  • PEDS: total number of pedestrians involved in the accident

  • MONTH: month of which the accident took place

  • DAY_WEEK: day of the week the accident took place(1=Sunday)

  • HOUR: hour of the day the accient took place(military time)

  • MAX_SEV: the max severity of the injuries(0 = No Apparent Injury, 1 = Possible Injury, 2 = Suspected Minor Injury, 3 = Suspected Serious Injury)

  • MAN_COLL: how many people were involved

  • WRK_ZONE: was it a work zone or not(0=NO, 1=YES)

  • VARIABLES WE WANT TO PREDICT

  • ALCOHOL: was there alcohol present in the accident(1=YES, 2=NO)

  • NUM_INJ: total number of injuries resulting from the accident

Summary Statistics

Column

Summary Statistics

     REGION        URBANICITY       VE_TOTAL          PEDS        
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :0.00000  
 1st Qu.:3.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.00000  
 Median :3.000   Median :1.000   Median :2.000   Median :0.00000  
 Mean   :2.949   Mean   :1.212   Mean   :1.845   Mean   :0.07138  
 3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:0.00000  
 Max.   :4.000   Max.   :2.000   Max.   :5.000   Max.   :2.00000  
    NUM_INJ           MONTH           DAY_WEEK          HOUR      
 Min.   :0.0000   Min.   : 1.000   Min.   :1.000   Min.   : 0.00  
 1st Qu.:0.0000   1st Qu.: 4.000   1st Qu.:3.000   1st Qu.:10.00  
 Median :1.0000   Median : 7.000   Median :4.000   Median :14.00  
 Mean   :0.7903   Mean   : 6.768   Mean   :4.109   Mean   :13.74  
 3rd Qu.:1.0000   3rd Qu.:10.000   3rd Qu.:6.000   3rd Qu.:17.00  
 Max.   :8.0000   Max.   :12.000   Max.   :7.000   Max.   :99.00  
    ALCOHOL         MAX_SEV          MAN_COLL         WRK_ZONE      
 Min.   :1.000   Min.   :0.0000   Min.   : 0.000   Min.   :0.00000  
 1st Qu.:2.000   1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.:0.00000  
 Median :2.000   Median :1.0000   Median : 1.000   Median :0.00000  
 Mean   :1.928   Mean   :0.9398   Mean   : 2.938   Mean   :0.02825  
 3rd Qu.:2.000   3rd Qu.:2.0000   3rd Qu.: 6.000   3rd Qu.:0.00000  
 Max.   :2.000   Max.   :5.0000   Max.   :98.000   Max.   :4.00000  
 REL_ROADNAME       LGT_CONDNAME       WEATHER1NAME      
 Length:1345        Length:1345        Length:1345       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         

Column {data-width=300} Column {data-height=500} ———————————————————————– ### Transform Variables

ALCOHOL (Alcohol Involved?)

# A tibble: 2 × 2
  ALCOHOL     n
  <chr>   <int>
1 1          97
2 2        1248

NUM_INJ (High or Low Median Value)

Data Viz #1

Column

Response Variables

ALCOHOL YES(1)/NO(2)

We can see we have about 93% of the data as no alcohol involved in the accident. Looking at the potential predictors related to ALCOHOL, the strongest relationships are between REGION, MAX_SEV, VE_TOTAL, and NUM_INJ.

Column

Transform Variables

Data Viz #2

Column

Response Variables

MEDV

We see the largest concentration of values around 0-1 injuries. Looking at the potential predictors related to MEDV, the strongest relationships occur between PEDS, MAX_SEV, and VE_TOTAL. The data is also skewed to the left. We can see a large number of values around 0-2 injuries because there is usually none or only a few injuries in a car accident.

Column

Transform Variables

NUM_INJ Analysis

Row

Predict Number of Injuries

For this analysis we will use a Linear Regression Model.

Adjusted R-Squared

51 %

RMSE

0.7

Row

Regression Output

Estimate Std. Error t value Pr(>|t|)
MAX_SEV 0.640 0.019 34.458 0.000
VE_TOTAL 0.320 0.043 7.472 0.000
WEATHER1NAMEReported as Unknown -1.571 0.510 -3.084 0.002
WEATHER1NAMEFog, Smog, Smoke -0.449 0.242 -1.858 0.063
WRK_ZONE2 -1.259 0.707 -1.781 0.075
HOUR 0.005 0.003 1.775 0.076
URBANICITYDusk -0.235 0.137 -1.709 0.088
WEATHER1NAMEOther 0.805 0.500 1.611 0.108
REGION 0.040 0.026 1.525 0.127
WRK_ZONE1 0.355 0.238 1.490 0.137
URBANICITYDaylight -0.072 0.054 -1.341 0.180
WEATHER1NAMERain 0.078 0.064 1.229 0.219
REL_ROADNAMEIn Parking Lane/Zone -0.780 0.733 -1.064 0.288
REL_ROADNAMEOutside Trafficway -0.696 0.751 -0.927 0.354
URBANICITYDawn -0.151 0.165 -0.917 0.359
MAN_COLL -0.003 0.003 -0.828 0.408
PEDS 0.070 0.085 0.823 0.411
REL_ROADNAMEOn Roadside -0.558 0.725 -0.770 0.442
REL_ROADNAMEOn Roadway -0.547 0.719 -0.760 0.448
REL_ROADNAMEOn Shoulder -0.558 0.755 -0.739 0.460
REL_ROADNAMEOff Roadway-Location Unknown -0.608 0.879 -0.692 0.489
REL_ROADNAMEGore -0.674 1.015 -0.664 0.507
MONTH 0.003 0.006 0.520 0.603
URBANICITYDark - Unknown Lighting 0.142 0.276 0.513 0.608
URBANICITYOther -0.337 0.708 -0.476 0.634
WRK_ZONE4 -0.104 0.289 -0.360 0.719
WEATHER1NAMENot Reported -0.025 0.084 -0.301 0.764
URBANICITYNot Reported -0.063 0.235 -0.269 0.788
WEATHER1NAMECloudy -0.013 0.061 -0.208 0.836
WRK_ZONE3 0.128 0.706 0.181 0.856
DAY_WEEK 0.002 0.010 0.179 0.858
URBANICITYDark - Not Lighted 0.011 0.080 0.135 0.892
REL_ROADNAMEOn Median -0.070 0.735 -0.095 0.925
WEATHER1NAMESleet or Hail 0.049 0.719 0.068 0.946
WEATHER1NAMESnow 0.007 0.170 0.043 0.966
(Intercept) -0.013 0.737 -0.018 0.986

Residual Assumptions Explorations

Row

Analysis Summary

After examining this model, we determine that there are some predictors that are not important in predicting the number of injuries, so a pruned version of the model is created by removing predictors that are not significant.

Row

Predict total number of injuries Final Version

For this analysis we will use a pruned Linear Regression Model. We removed URBANICITY, PEDS, MONTH, DAY_WEEK, HOUR, WRK_ZONE, and REL_ROADNAME.

Adjusted R-Squared

51 %

RMSE

0.71

Row

Regression Output

Estimate Std. Error t value Pr(>|t|)
MAX_SEV 0.642 0.018 36.493 0.000
VE_TOTAL 0.289 0.033 8.867 0.000
(Intercept) -0.489 0.099 -4.965 0.000
WEATHER1NAMEReported as Unknown -1.418 0.506 -2.803 0.005
REGION 0.050 0.026 1.927 0.054
WEATHER1NAMEFog, Smog, Smoke -0.408 0.237 -1.723 0.085
WEATHER1NAMERain 0.098 0.063 1.563 0.118
WEATHER1NAMEOther 0.776 0.500 1.551 0.121
MAN_COLL -0.003 0.003 -0.933 0.351
WEATHER1NAMENot Reported -0.042 0.081 -0.515 0.607
WEATHER1NAMESleet or Hail -0.189 0.707 -0.267 0.789
WEATHER1NAMESnow 0.030 0.170 0.179 0.858
WEATHER1NAMECloudy -0.002 0.061 -0.035 0.972

Residual Assumptions Explorations

Row

Analysis Summary

After examining this model, looking at the residual plots we can see that there are some issues with our data. The residual vs. fitted plot seems to have a lot of patterns in it which mean that the model is unable to capture all the systematic variations within the data.There also seems to be outliers at the top of the residual plot. The Q-Q residual plot is pretty curved and only fall on the line near the middle of the plots, with the beggining and ends curving away from the line.This means that the residuals deviate from the expected normal deviation, resulting in lighter and heavier than predicted normality.

Reducing the predictors that did not help with prediction of the number of injuries resulting form a car crash and did not have a big impact our fit statistics (R-square and RMSE (root mean squared error)).

ALCOHOL Analysis

Row {data-height=900}


Predict Alcohol involvness

Regression Tree

Neural Nets

Bootstrap Forest

Row

Regression/ Estimation Model Comparison

* Overall *

As we can see from the model comparison that the best model to use for predicting if there was alcohol involved in the crash or not is the neural network regression. This has an over high r-squared compared to the other models as well as a low RMSE when compared to the others. Although these numbers are not very high in accuracy and low in error, it is the best out of the 4 models we are comparing.

Conclusion

Summary

In Conclusion, we can see that our predictors do not help very well to predict whether or not alcohol was involved in the car accident as well as the number of injuries resulting from the car accident.

Can the day of the week predict weather or not there was alcohol present in the crash? The day of the week has influence on weather or not there was alcohol involved. Saturday and Sunday have the most number of car crashes with alcohol involved. This makes sense because drinking occurs most primarily on the weekends compared to the weekdays.

Does the number of vehicles predict the number of injuries? We can see that there is a higher number of injuries predicted when there is more number of vehicles involved in the crash.

What region has the most crashes? We can see that the South (MD, DE, DC, WV, VA, KY, TN, NC, SC, GA, FL, AL, MS, LA, AR, OK, TX) has the most number of crashes at about 60% of the total and about 6.3% being alcohol related crashes.

Which Variables best predict the number of injuries? We can see that the best predictors to use while predicting the total number of injuries involved were the number of vehicles, the max severity of the injuries, and the region of which the car accident took place.

---
title: "Nationwide Car Crash 2019 Report"
output: 
  flexdashboard::flex_dashboard:
    vertical_layout: scroll
    source_code: embed
---

```{r setup, include=FALSE, warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(flexdashboard)
library(tidyverse)
library(GGally)
library(caret) #for logistic regression
library(broom) #for tidy() function
```

```{r load_data}
CCREPORT <- read_csv("AccidentReport2019.csv")
```

Introduction {data-orientation=rows}
=======================================================================

Row {data-height=250}
-----------------------------------------------------------------------

### Overview 

For this project, we will follow the DCOVAC process. The process is listed below:

DCOVAC – THE DATA MODELING FRAMEWORK

* DEFINE the Problem
* COLLECT the Data from Appropriate Sources
* ORGANIZE the Data Collected
* VISUALIZE the Data by Developing Charts
* ANALYZE the data with Appropriate Statistical Methods
* COMMUNICATE your Results

Row {data-height=650}
-----------------------------------------------------------------------

### The Problem & Data Collection

#### The Problem
* ***Problem Description***
The Car Crash 2019 Data used in this dashboard shows all of the car crash data from 2019 in the United States. We will examine the variables in the dataset to determine what helps to predict the number of injuries in a car crash as well as if there was alcohol involved or not. 

### The Questions
* ***Aanlysis Questions***
1. Can the day of the week predict weather or not there was alcohol present in the crash?
2. Does the number of vehicles predict the number of injuries?
3. What region has the most crashes?
4. Which Variables best predict the number of injuries?

#### The Data
This dataset has 1345 rows and 16 variables. For this analysis, we will ignore the all of the categorical named variables that come with the dataset.

#### Data Sources
https://www.nhtsa.gov/file-downloads?p=nhtsa/downloads/FARS/

### Description of the Variables in the Dataset
* ***VARIABLES TO PREDICT WITH:***

* *URBANICITY*: where the accident took place(1=Rural Area, 2=Urban Area)
* *REGION*: what region the accident took place(1=Northeast, 2=Midwest, 3=South, 4=West)
* *VE_TOTAL*: total number of vehichles involved in the accident
* *PEDS*: total number of pedestrians involved in the accident
* *MONTH*: month of which the accident took place
* *DAY_WEEK*: day of the week the accident took place(1=Sunday)
* *HOUR*: hour of the day the accient took place(military time)
* *MAX_SEV*: the max severity of the injuries(0 = No Apparent Injury, 1 = Possible Injury, 2 = Suspected Minor Injury, 3 = Suspected Serious Injury)
* *MAN_COLL*: how many people were involved
* *WRK_ZONE*: was it a work zone or not(0=NO, 1=YES)

* ***VARIABLES WE WANT TO PREDICT***

* *ALCOHOL*: was there alcohol present in the accident(1=YES, 2=NO)
* *NUM_INJ*: total number of injuries resulting from the accident

Summary Statistics
=======================================================================


Column {data-width=650}
-------------------------------------------------------------------
### Summary Statistics
```{r, cache=TRUE}
#the cache=TRUE can be removed. This will allow you to rerun your code without it having to run EVERYTHING from scratch every time. If the output seems to not reflect new updates, you can choose Knit, Clear Knitr cache to fix.
#View data
#remove RAD due to it being an index so not a real continuous number
CCREPORT <- select(CCREPORT,-MONTHNAME,-REGIONNAME,-URBANICITYNAME,-DAY_WEEKNAME,-MAX_SEVNAME,-ALCOHOLNAME,-HARM_EVNAME,-YEAR,-HARM_EV)
summary(CCREPORT)
```


Column {data-width=300}
Column {data-height=500}
-----------------------------------------------------------------------
### Transform Variables

```{r, cache=TRUE}
CCREPORT <- mutate(CCREPORT,WRK_ZONE=as.factor(WRK_ZONE),
             URBANICITY=as.factor(URBANICITY),
             URBANICITY=as.factor(MONTH),
             URBANICITY=as.factor(DAY_WEEK),
             URBANICITY=as.factor(HOUR),
             URBANICITY=as.factor(ALCOHOL),
             URBANICITY=as.factor(MAX_SEV),
             URBANICITY=as.factor(WEATHER1NAME),
             URBANICITY=as.factor(REL_ROADNAME),
             URBANICITY=as.factor(LGT_CONDNAME))
```
#### ALCOHOL (Alcohol Involved?)
```{r, cache=TRUE}
tibble::as_tibble(select(CCREPORT,ALCOHOL) %>%
  table())
```

#### NUM_INJ (High or Low Median Value)

<!--Instructions to import .jpg or .png images
use getwd() to see current path structure 
copy file into same place as .Rmd file
put the path to this file in the link
format: ![Alt text](book.jpg) -->

![](NumInjDist.png)


Data Viz #1
=======================================================================


Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables
#### ALCOHOL YES(1)/NO(2)
```{r, cache=TRUE}
as_tibble(select(CCREPORT,ALCOHOL) %>%
         table()) %>%
  ggplot(aes(y=n,x=ALCOHOL)) + geom_bar(stat="identity")
```

We can see we have about 93% of the data as no alcohol involved in the accident. Looking at the potential predictors related to  ALCOHOL, the strongest relationships are between REGION, MAX_SEV, VE_TOTAL, and NUM_INJ.


Column {data-width=500}
-----------------------------------------------------------------------

### Transform Variables

```{r, cache=TRUE}
ggpairs(select(CCREPORT,ALCOHOL,NUM_INJ,REGION,URBANICITY,VE_TOTAL,PEDS,MONTH))
```


Data Viz #2
=======================================================================


Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables

#### MEDV
```{r, cache=TRUE}
ggplot(CCREPORT, aes(x = NUM_INJ)) +
  geom_bar()
```

We see the largest concentration of values around 0-1 injuries. Looking at the potential predictors related to MEDV, the strongest relationships occur between PEDS, MAX_SEV, and VE_TOTAL. The data is also skewed to the left. We can see a large number of values around 0-2 injuries because there is usually none or only a few injuries in a car accident.


Column {data-width=500}
-----------------------------------------------------------------------

### Transform Variables
```{r, cache=TRUE}
ggpairs(select(CCREPORT,ALCOHOL,NUM_INJ,DAY_WEEK,HOUR,MAX_SEV,MAN_COLL))
```


NUM_INJ Analysis {data-orientation=rows}
=======================================================================

Row
-----------------------------------------------------------------------

### Predict Number of Injuries
For this analysis we will use a Linear Regression Model.

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
NUM_INJ_lm <- lm(NUM_INJ ~ . -ALCOHOL,data = CCREPORT)
summary(NUM_INJ_lm)
```

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(NUM_INJ_lm)
```

### Adjusted R-Squared

```{r, cache=TRUE}
ARSq<-round(summary(NUM_INJ_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```

### RMSE

```{r, cache=TRUE}
Sig<-round(summary(NUM_INJ_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```

Row
-----------------------------------------------------------------------

### Regression Output

```{r,include=FALSE, cache=TRUE}
#knitr::kable(summary(NUM_INJ_lm)$coef, digits = 3) #pretty table output
summary(NUM_INJ_lm)$coef
```

```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(NUM_INJ_lm))[,4])  
out <- coef(summary(NUM_INJ_lm))[idx,] 
knitr::kable(out, digits = 3) #pretty table output
```

### Residual Assumptions Explorations

```{r, cache=TRUE}
plot(NUM_INJ_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```

Row
-----------------------------------------------------------------------

### Analysis Summary
After examining this model, we determine that there are some predictors that are not important in predicting the number of injuries, so a pruned version of the model is created by removing predictors that are not significant.

Row
-----------------------------------------------------------------------

### Predict total number of injuries Final Version
For this analysis we will use a pruned Linear Regression Model. We removed URBANICITY, PEDS, MONTH, DAY_WEEK, HOUR, WRK_ZONE, and REL_ROADNAME.

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
NUM_INJ_lm <- lm(NUM_INJ ~ . -ALCOHOL -URBANICITY -PEDS -MONTH -DAY_WEEK -HOUR -WRK_ZONE - REL_ROADNAME -LGT_CONDNAME,data = CCREPORT)
summary(NUM_INJ_lm)
```

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(NUM_INJ_lm)
```

### Adjusted R-Squared

```{r, cache=TRUE}
ARSq<-round(summary(NUM_INJ_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```

### RMSE

```{r, cache=TRUE}
Sig<-round(summary(NUM_INJ_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```

Row
-----------------------------------------------------------------------

### Regression Output

```{r, include=FALSE, cache=TRUE}
knitr::kable(summary(NUM_INJ_lm)$coef, digits = 3) #pretty table output
```

```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(NUM_INJ_lm))[,4])  
out <- coef(summary(NUM_INJ_lm))[idx,] 
knitr::kable(out, digits = 3) #pretty table output
```

### Residual Assumptions Explorations

```{r, cache=TRUE}
plot(NUM_INJ_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```


Row
-----------------------------------------------------------------------

### Analysis Summary
After examining this model, looking at the residual plots we can see that there are some issues with our data. The residual vs. fitted plot seems to have a lot of patterns in it which mean that the model is unable to capture all the systematic variations within the data.There also seems to be outliers at the top of the residual plot. The Q-Q residual plot is pretty curved and only fall on the line near the middle of the plots, with the beggining and ends curving away from the line.This means that the residuals deviate from the expected normal deviation, resulting in lighter and heavier than predicted normality.

Reducing the predictors that did not help with prediction of the number of injuries resulting form a car crash and did not have a big impact our fit statistics (R-square and RMSE (root mean squared error)).







ALCOHOL Analysis {data-orientation=rows}
=======================================================================

Row {data-height=900}

-----------------------------------------------------------------------

### Predict Alcohol involvness
![](ALCOHOL.png)
   
### Regression Tree
![](RegressionTree.png)



### Neural Nets
![](Neural.png)

### Bootstrap Forest
![](BootStrap.png)

Row 
-------------------------------------
    
### Regression/ Estimation Model Comparison
![](RegressionModelComparison.png)
* ***Overall*** *

As we can see from the model comparison that the best model to use for predicting if there was alcohol involved in the crash or not is the neural network  regression. This has an over high r-squared compared to the other models as well as a low RMSE when compared to the others. Although these numbers are not very high in accuracy and low in error, it is the best out of the 4 models we are comparing. 

Conclusion
=======================================================================
### Summary


In Conclusion, we can see that our predictors do not help very well to predict whether or not alcohol was involved in the car accident as well as the number of injuries resulting from the car accident.

**Can the day of the week predict weather or not there was alcohol present in the crash?**
The day of the week has influence on weather or not there was alcohol involved. Saturday and Sunday have the most number of car crashes with alcohol involved. This makes sense because drinking occurs most primarily on the weekends compared to the weekdays.

**Does the number of vehicles predict the number of injuries?**
We can see that there is a higher number of injuries predicted when there is more number of vehicles involved in the crash.

**What region has the most crashes?**
We can see that the South (MD, DE, DC, WV, VA, KY, TN, NC, SC, GA, FL, AL, MS, LA, AR, OK, TX) has the most number of crashes at about 60% of the total and about 6.3% being alcohol related crashes.

**Which Variables best predict the number of injuries?**
We can see that the best predictors to use while predicting the total number of injuries involved were the number of vehicles, the max severity of the injuries, and the region of which the car accident took place.